Overview: LinkedIn Post Analysis
This project applies Natural Language Processing (NLP) techniques to analyze my LinkedIn posts from the past year. The primary goal is to combine data collection, text pre-processing, and statistical analysis to uncover how my word usage and content patterns have evolved.
The workflow includes:
Data acquisition – retrieving my LinkedIn posts and associated metadata.
Post type distribution – plotting the frequency of different post categories over time.
Text pre-processing – tokenizing text, removing stop words, filtering out INSTANT_SHARE reposts, stripping links, and producing a tidy dataset with one token per row.
Word frequency analysis – calculating unigram and bigram counts.
Change-over-time analysis – measuring shifts in word usage using slope calculations for both unigrams and bigrams.
This end-to-end process demonstrates how NLP can be applied to personal social media data for content strategy insights and longitudinal language trends.
Loading the Data
# A tibble: 6 × 4
urn text created_at type
<chr> <chr> <dttm> <chr>
1 urn:li:activity:7361477882038099968 "3 𝐰𝐚𝐲𝐬 𝐈 𝐢𝐧𝐯𝐞𝐬… 2025-08-13 14:24:56 TEXT
2 urn:li:activity:7360783785690374145 "Last semester,… 2025-08-11 16:26:51 CELE…
3 urn:li:activity:7360163662294048769 <NA> 2025-08-09 23:22:42 INST…
4 urn:li:activity:7359412088483504130 "🎯 Graduating S… 2025-08-07 21:36:13 UNKN…
5 urn:li:activity:7359252038888669188 "One habit that… 2025-08-07 11:00:14 TEXT
6 urn:li:activity:7358345640961024000 "I just pulled … 2025-08-04 22:58:32 IMAGE
POSIXct[1:72], format: "2025-08-13 14:24:56" "2025-08-11 16:26:51" "2025-08-09 23:22:42" ...
Post Type Distribution
Let’s look at the distribution of the posts made over time.
Post Types Over Time
Let’s track how different post types evolved over time.
I kicked off my LinkedIn journey by sharing the certificates I was earning from Coursera, edX, LinkedIn Learning, and Udemy. In January 2025 I made a simple promise to post consistently. After finishing a Personal Selling course I stopped leaning on quick shares and started writing my own updates. By March the shift was obvious. Images, text posts, and document carousels became the core of my feed, while instant reposts faded into the background. Over the summer I added video to widen the story I can tell. The result is a steady move from announcements to original content that teaches, documents, and shows my work. Next, I will keep the cadence, scale the formats that land, and use video more often to bring projects to life.
Text Preprocessing Pipeline
This section implements a comprehensive text preprocessing pipeline designed specifically for LinkedIn post analysis. The preprocessing steps are carefully sequenced to transform raw social media text into clean, analyzable tokens while preserving meaningful content and removing noise.
Preprocessing Strategy Overview
1. Data Filtering: We begin by filtering out INSTANT_SHARE posts, which are typically reposts or shares that don’t contain original content. This ensures our analysis focuses on authentic user-generated content.
2. Link Removal: External links are identified and removed using regex patterns that match common URL formats (http/https, www, linkedin.com, etc.). This prevents links from being treated as meaningful text tokens while preserving the semantic content of the posts.
3. Text Normalization: The text undergoes several normalization steps including conversion to lowercase, removal of extra whitespace, and handling of special characters to ensure consistent tokenization.
4. Tokenization: Text is broken down into individual words using the unnest_tokens() function, which handles punctuation, contractions, and word boundaries appropriately for social media text.
5. Stopword Removal: Common English stopwords are removed to focus analysis on content-bearing words. This includes articles, prepositions, common verbs, and other function words that don’t contribute to topic analysis.
6. Hashtag Preservation: Hashtags are intentionally preserved as they often contain valuable topic and sentiment information specific to social media content.
7. Quality Filtering: Final filtering removes very short tokens (less than 3 characters) and ensures we have meaningful content for analysis.
Implementation Details
The preprocessing pipeline is implemented using tidytext principles, ensuring that each step produces a clean, structured dataset ready for downstream analysis. The pipeline maintains data integrity by preserving post metadata (timestamps, engagement metrics) while transforming only the text content.
Posts after filtering INSTANT_SHARE: 65
Total tokens after preprocessing: 3460
Average words per post: 53.2
# A tibble: 20 × 3
created_at word type
<dttm> <chr> <chr>
1 2025-08-13 14:24:56 𝐰𝐚𝐲𝐬 TEXT
2 2025-08-13 14:24:56 𝐢𝐧𝐯𝐞𝐬𝐭𝐞𝐝 TEXT
3 2025-08-13 14:24:56 𝐠𝐫𝐨𝐰𝐭𝐡 TEXT
4 2025-08-13 14:24:56 𝐝𝐚𝐭𝐚 TEXT
5 2025-08-13 14:24:56 𝐚𝐧𝐚𝐥𝐲𝐬𝐭 TEXT
6 2025-08-13 14:24:56 𝐈𝐧𝐯𝐞𝐬𝐭𝐢𝐧𝐠 TEXT
7 2025-08-13 14:24:56 𝐬𝐞𝐜𝐨𝐧𝐝 TEXT
8 2025-08-13 14:24:56 𝐬𝐜𝐫𝐞𝐞𝐧 TEXT
9 2025-08-13 14:24:56 semester TEXT
10 2025-08-13 14:24:56 msba TEXT
11 2025-08-13 14:24:56 program TEXT
12 2025-08-13 14:24:56 chad TEXT
13 2025-08-13 14:24:56 birger TEXT
14 2025-08-13 14:24:56 piece TEXT
15 2025-08-13 14:24:56 advice TEXT
16 2025-08-13 14:24:56 forget TEXT
17 2025-08-13 14:24:56 monitor TEXT
18 2025-08-13 14:24:56 message TEXT
19 2025-08-13 14:24:56 time TEXT
20 2025-08-13 14:24:56 saved TEXT
Word Frequency Analysis
Now I’ll analyze the most frequent words in my LinkedIn posts to understand my content patterns and key themes.
Word Frequency Calculation
I’ll calculate the frequency of each word across all my posts, which will help identify my most commonly used terms and topics.
# A tibble: 20 × 4
word n total_words frequency
<chr> <int> <int> <dbl>
1 data 84 3460 0.0243
2 analytics 32 3460 0.00925
3 business 27 3460 0.00780
4 learning 26 3460 0.00751
5 real 26 3460 0.00751
6 i’m 24 3460 0.00694
7 project 24 3460 0.00694
8 i’ve 21 3460 0.00607
9 time 18 3460 0.00520
10 engineering 17 3460 0.00491
11 building 16 3460 0.00462
12 sql 16 3460 0.00462
13 skills 15 3460 0.00434
14 spark 15 3460 0.00434
15 bootcamp 14 3460 0.00405
16 built 14 3460 0.00405
17 experience 14 3460 0.00405
18 it’s 14 3460 0.00405
19 journey 14 3460 0.00405
20 excited 13 3460 0.00376
Word Frequency Visualization
I’ll create a visualization to show my most frequent words and their relative frequencies.
My language clusters around three things. First, the core theme is clear with “data,” “analytics,” and “business” leading the list. Second, the frequent “I’m” and “I’ve” signals a first-person voice that tells a personal story rather than posting generic updates. Third, tool names like SQL and Spark show that I share hands-on work. Words such as “project,” “learning,” “building,” and “journey” reinforce a build-in-public approach where I document progress and lessons for others.
Bigram Analysis
Since I use many professional phrases like “data analyst”, “business analytics”, and “data engineering”, I’ll analyze bigrams to capture these important two-word combinations and track how they evolved over time.
Bigram Frequency Calculation
# A tibble: 20 × 4
bigram n total_bigrams frequency
<chr> <int> <int> <dbl>
1 data engineering 14 1611 0.00869
2 data science 9 1611 0.00559
3 real world 9 1611 0.00559
4 zach wilson 9 1611 0.00559
5 engineering bootcamp 7 1611 0.00435
6 time series 5 1611 0.00310
7 blog post 4 1611 0.00248
8 business analytics 4 1611 0.00248
9 chad birger 4 1611 0.00248
10 data professionals 4 1611 0.00248
11 game changer 4 1611 0.00248
12 series forecasting 4 1611 0.00248
13 technical skills 4 1611 0.00248
14 beacom school 3 1611 0.00186
15 data analyst 3 1611 0.00186
16 data analytics 3 1611 0.00186
17 data modeling 3 1611 0.00186
18 data warehouse 3 1611 0.00186
19 dataengineering analytics 3 1611 0.00186
20 doordash delivery 3 1611 0.00186
Bigram Visualization
The bootcamp with Zach Wilson sparked my build-in-public habit and it shows up clearly. “Data engineering” is the dominant phrase, followed closely by “zach wilson,” “engineering bootcamp,” and “real world,” which frames my posts around practical work rather than theory. The next cluster points to my focus areas in data science. “Time series,” “series forecasting,” and “technical skills” highlight the topics I study and share. Mentions of “data professionals,” “business analytics,” and “data analyst” reflect how I position myself in the community, while “blog post” captures the way I document progress. References to “Chad Birger” and “Beacom School” anchor the story in my USD network and mentors.
Wordcloud Comparison: Unigrams vs Bigrams
I’ll create mirrored wordclouds to visually compare my most frequent individual words against my most frequent two-word phrases, providing an intuitive way to see the difference between single terms and professional phrases.
Side-by-Side Comparison
Single words cluster around data, analytics, learning, and projects, while bigrams spotlight data engineering, real world work, Zach Wilson, and time series. Together they confirm a shift toward original, applied content built in public.
Changes in Word Usage Over Time
I’ll analyze how my word usage has changed over time by creating monthly time bins and tracking word frequency changes. This will help identify which terms became more or less common in my LinkedIn posts throughout the year.
Creating Time-Based Word Counts
# A tibble: 363 × 5
time_floor word count time_total word_total
<dttm> <chr> <int> <int> <int>
1 2024-11-01 00:00:00 analysis 1 26 6
2 2024-11-01 00:00:00 data 2 26 84
3 2024-11-01 00:00:00 i’m 2 26 24
4 2024-11-01 00:00:00 i’ve 2 26 21
5 2024-11-01 00:00:00 real 1 26 26
6 2024-11-01 00:00:00 science 1 26 9
7 2024-11-01 00:00:00 share 2 26 9
8 2025-01-01 00:00:00 analyst 2 185 8
9 2025-01-01 00:00:00 analytics 2 185 32
10 2025-01-01 00:00:00 apply 1 185 8
# ℹ 353 more rows
Creating Nested Data for Statistical Analysis
# A tibble: 89 × 2
word data
<chr> <list>
1 analysis <tibble [5 × 4]>
2 data <tibble [8 × 4]>
3 i’m <tibble [7 × 4]>
4 i’ve <tibble [5 × 4]>
5 real <tibble [7 × 4]>
6 science <tibble [4 × 4]>
7 share <tibble [4 × 4]>
8 analyst <tibble [3 × 4]>
9 analytics <tibble [7 × 4]>
10 apply <tibble [4 × 4]>
# ℹ 79 more rows
Fitting Logistic Regression Models
I’ll fit logistic regression models to check if each word becomes more or less common over time. A positive slope means usage is increasing, while a negative slope means usage is decreasing.
Extracting Slopes and Statistical Significance
Identifying Significant Word Changes
# A tibble: 0 × 6
# ℹ 6 variables: word <chr>, data <list>, term <chr>, estimate <dbl>,
# std.error <dbl>, adjusted.p.value <dbl>
Visualizing Word Usage Changes Over Time
I put every post on a calendar, month by month, and watched the vocabulary move. For each month I counted how often a word showed up, then asked a simple question with a logistic model: is this word gaining ground or losing it as time passes. Since I tested many words, I corrected the p-values so only real signals make the cut. Finally I drew the paths so the shifts are easy to see.
The plot shows the arc. data stays the anchor, with clear surges in March and again in July. learning blooms in May when you leaned into teaching posts. analytics climbs through spring, cools in early summer, then turns up again in August. business drifts downward into the summer months. real holds a steady, lower baseline. Put together, it reads like a move from certificates and announcements to applied, teach-in-public content that your audience sticks with.
Summary of Key Findings
This read of my LinkedIn year shows a clear shift from announcements to original, teach-in-public content. Images, text, and document carousels became the core. Vocabulary moved with that shift. “Data” stayed the anchor, “learning” spiked during teaching months, and “analytics” trended up again late summer. Engagement follows the same pattern. Carousels and engineering topics travel well, and posts tied to the DE bootcamp or Zach Wilson reach more people, likely due to network effects.
What worked
Teaching carousels with clear takeaways.
Role and craft language, for example data engineering, data quality, time series.
Community touchpoints, mentors, and program references